Databootcamp Final Project: Airbnb Listing Prices and Reviews & COVID-19

Haanbi Kim, Christian Sarkis, Jennifer Zhang

Objective

The objective of this project is to take Airbnb data from NYC, Paris, and Sydney (cities with high levels of housing costs) and analyze 3 different scenarios amongst the listings.

  1. How Price and Superhost Status vary with listing characteristics
  2. How closely the prices of Airbnb listings mirror actual hotel listings
  3. How volume of reviews changed with the onset of COVID-19

Instructions

  1. Create a repository on GitHub and name it: data_bootcamp_final_project
  1. Populate the repository on GitHub with the following items
  1. Upload a .ipynb notebook to NYU classes which contains only the Readme File below and the hyper-link to your repository.

Readme file should contain the following:

This project was completed by insert full name here in partial fulfilment of ECON-UB.0232, Data Bootcamp, Spring 2021. I certify that the NYU Stern Honor Code applies to this project. In particular, I have: Clearly acknowledged the work and efforts of others when submitting written work as our own. The incorporation of the work of others–including but not limited to their ideas, data, creative expression, and direct quotations (which should be designated with quotation marks), or paraphrasing thereof– has been fully and appropriately referenced using notations both in the text and the bibliography. And I understand that: Submitting the same or substantially similar work in multiple courses, either in the same semester or in a different semester, without the express approval of all instructors is strictly forbidden. I acknowledge that a failure to abide by NYU Stern Honor Code will result in a failing grade for the project and course.

And, a one paragraph project description

Description: In this project, we examined the Airbnb listings in New York City, Sydney, and Paris. We compared the the types of listings and their prices across the cities as well as with local renting costs. Additionally, we looked at the impact of the Covid-19 pandemic on these cities in relation to each city's lockdown policy.

Grading:
Projects will be graded on their overall quality. This includes, but is not restricted to, these categories:

Link to Google Slides Requirement: https://docs.google.com/presentation/d/1WrobBqffEtvr5vmcedFVKRVN75nZcoheW2GOUoZsgOQ/edit#slide=id.p

Preliminaries

Data is obtained from: http://insideairbnb.com/get-the-data.html
Using listing and review data for the following cities: NYC, Paris, Sydney
Rationale for selected cities: Are known to have high costs of living and housing and are located in entirely different continents

1. Airbnb Listings: Prices and Hosts

In this following section, we will be conducting data analysis on several different price- and listing-related factors of the dataset, in two parts.

1) Linear regression models predicting price against several listing characteristics
2) Classification models predicting superhost status against several host characteristics

Regressions on Listing Prices with respect to Different Factors

Independent Variables: ['review_scores_rating', 'bedrooms', 'amenities_num']


Dependent Variable: 'price'

In this section, we will be seeing how listing prices in New York City, Paris, and Sydney vary with the independent variables in separate models. Independent variables are chosen because they intuitively would influence the price of a house or apartment, so we are seeing whether this logic would apply to the data in these models, espeically across these three cities which boast the highest levels of rent in the world. We expect that these variables will have some kind of a positive correlation with prices.

First, we will preview the scatterplots for each variable to see whether any adjustments need to be made and create a new column that represents the number of amenities per listing

All scatter plots show that linear regression may not be the best approach in predicting listing prices. We will predict by modifying price to its log form (i.e. ln('price')) and then create the linear regression models since the data points are shaped in more or less an exponential form, and doing so may help guide a better fit.

Functions

Model 1: Review Scores Rating - Price (log)

Explanation and Flaws:
The model shows a high R-squared value, meaning that the model generally has a good fit, as the model explains 93% of the variation in the lnprice variable. Especially for rating scores that are above 40, the model predicts listing prices (in the natural log) quite efficiently. One flaw is that the model doesn't capture price for ratings that score less than 40, as seen in the graph above.

Model 2: Bedrooms - Price (log)

Explanation and Flaws:

The model shows a lower R-squared value than that of the first regression model's. The trendline doesn't seem to fit the datapoints too accurately, though, and for listings with more than 10 bedrooms the fit becomes worse. Since the current model doesn't seem to show the rationale in the poorer fit all too well, we will try to scale the bedrooms axis so that the differences are more stable by taking log of the bedrooms column. This will help us focus on the more concentrated datapoints (i.e. listings with less than 10 bedrooms).

Additional Explanation and Flaws:
In this model, we can get a better understanding of the bulk of the data. The data are arranged in (generally) distinct category based on the number of bedrooms. We can see that the trendline seems to fit the data a little better with the adjustment, but the issue here is that because bedroom count is a discrete variable, and since more listings (regardless of price) have less bedrooms than more, the dispersion of listings for listings with less bedrooms is greater, thus the model has difficulty in capturing an accurate prediction for those sections.
TL;DR - The variance for listings where lnbedrooms = 0 (1 bedroom) is larger than the variance for listings where lnbedrooms = 2, so it is easier to predict the latter than the former, explaining the poorer fit.

Model 3: Number of Amenities - Price (log)

Explanation and Flaws:
The R-squared for this model is the lowest compared to the other two models, with the regression model explaining only 4% of the variation in the lnprice variable. As seen in the visualization of the regression model, the data points are dispersed greatly, more than the bedrooms variable. As a result, it would be difficult for a linear regression model to capture all of the variation. However, what the model is able to show is an overall trend, which is what we are looking for as well. The number of amenities at an Airbnb stay generally correlates with a moderate positive increase in prices, which is similar to our initial hypothesis.

Conclusion:
Given the past three regression models, it can be said that our intuition of what ideally raises the prices of a listing price comes slightly in line with the results. Listing prices have a higher correlation and fit with respect to review scores, and lower correlation and fit with respect to the number of bedrooms and amenities. This hints at how pricing of listings will depend on explicit reviews of the experiences of previous people who have lived at the stay, rather than more physical qualities of the listing.

Classification on Superhosts and Non-Superhosts

In this section, we will be creating a classification model to see how much influence the number of reviews, review ratings, and response rates separately have on whether a host is a superhost or not.
Independent variable is determined based on the intuitive correlation one would draw between the potential independent variable and superhost status.
Classification models are initially attempted using a Logistic Regression model, and any accommodations will be made later if we see any issues with Logistic Regression.

Independent Variables Used: ['number_of_reviews', 'host_response_rate']
Dependent Variable: 'host_is_superhost'

Functions

Classification 1: Number of Reviews - Host is Superhost

Above are some of the summary statistics from this classification model, in order to specificity. At the top is the R-scored value for testing and total data (i.e. how well the model explains the variations in the host_is_superhost column). After that is a confusion matrix, which specifically shows how the model fares (top left: True Negative, top right: False Positive, bottom left: False Negative, bottom right: True Positive).

Finally, at the bottom is a classification report. The scores for "precision," "recall," and "f1-score" represent the accuracy of predictions from the model in each respective potential outcome (0 and 1). For simplicity of interpretation, the closer these values are to 1, the better. As seen in the report above, the accuracy of predictions for 1, observation where the host is a superhost, is worse (this concept will come into play later).

The following is a representation of a better confusion matrix visualization based on the proportion of responses.

Explanation and flaws:
Given all of the information we have seen so far, we can see that the model does a good job of predicting that a host is not a superhost correctly for 85% of its total predictions. But, it also does a poor job in predicting that a host is a superhost, only predicting it correctly for 1% of its total predictions when superhost observations make up 14% of the total data. In this sense, its higher R-squared value may be misleading when it predominantly favors only one part of the data.

Classification 2: Host Response Rate - Host is Superhost

Bad classifying model using logistic regression as every observation is predicted as only one value, let's find a better version for a classification model.

That looks a bit better as there is a large number of true positives, but there is also a larger number of false positives and lower true negatives than before (a tradeoff, in a way). Since this is a model that changes with the value that represents n_neighbors, we will see how the score of the model changes with different values of n_neighbhors that will be used to predict superhost status.

The score that is shown in the plots is misleading, as we get a similar problem as in the Classification 1 section where there is a smaller number of true positives (bottom right). Since the default value for n_neighbors brought better results in terms of predicting true positives, we will create a confusion matrix and classification report based on the default value for n_neighbors (5).

Explanation and flaws:
In this set of classification models, similar to the classification models, we have a trend where the rate of predicting hosts who aren't superhosts is higher than the rate of predicting hosts who are superhosts. This may suggest that the following two independent variables may not do the best job at predicting that the host is a superhost, but better at predicting that a host is not a superhost (better prediction on one side than the other).

Conclusion:
The classification models predict only one part of the data more efficiently than the other part overall, where both models predict that the host is a not a superhost with greater accuracy, while the predictions that the host is a superhost falls in accuracy. When visualizing the datapoints, we can get a possible explanation into this phenomenon. As seen below, there is more overlap in the sense that a potential host can be a superhost or not a superhost regardless of the number of reviews or its response rate. In other words, at 200 reviews, there are many instances where a host is a superhost and there are many instances where a host is not a superhost. This can suggest as to why the classification models perform poorly when predicting the observations that are superhosts.

2. Airbnb Listings vs. Hotel Prices

Airbnb advertises itself as a cost effective alternative to traditional hotels. In order to test this, we decided to compare 2 bed private room Airbnb prices to the current average double hotel room rate (Trivago Hotel Price Index) in each city. The choice to use 2 bed private room listings was made to control for some of the many differences between hotels and Airbnbs.

Our results here are interesting. Firstly, outside of a few outliers, there do not seem to be major differences in the distributions of deal ratings between cities (With the exception of Sydney having more extreme negative deals; possibly due to the availability of beach homes).

Zooming into the graph reveals that a large number of Airbnb rentals are actually more expensive than the average hotel rate for their city. This is pretty novel considering a hotel will usually have more amenities than a simple private room for Airbnb.

There are also clearly some deals on Airbnb listings in these cities, even if they are outnumbered by the bad; how do we go about finding these deals? Let's try and find out by mapping our positive deals and seeing if we can notice any trends.

Explanations:
Zooming into any of our three cities reveals a consistent trend: better deals from the average are usually found farther away from central city locations.

For example, in NYC, we see much higher deal scores in the other boroughs when compared to Manhattan, in Paris, we see more deals farther away from the Seine River, and in Sydney we see better deals as we move farther away from the coast line.

While it could be argued that hotel rates in these areas may also be generally lower, reducing the significancce of the deal. the fact still stands that the best Airbnb deals for these cities are going to be located in non-traditional vacation areas. One explaination for this may be that real estate is cheaper in these areas, paving the way for more amateur hosts that are willing to bargain more on their prices.

Ultimately there are deals over hotels on Airbnb in these major cities, but due diligence must be taken to make sure that you are actually getting a deal.

3. Airbnb Listings in Relation to COVID-19 Lockdown Policies

When looking at available data for Airbnb, we were unable to find detailed information covering prices going back further than April 2020. The data is unavailable to those who don't make a special request and pay to access. So in order to observe Airbnb's performance in Paris, Sydney, and New York City, we've opted to look at the volume of comments. While not every renter will leave a comment, there is a direct relationship between the reviews left and the number of people using Airbnb in the given cities.

While there is a direct relationship between number of reviews and the number of consumers on Airbnb in the forementioned 3 cities, examining the change in number of reviews when studying the increases and decreases in Airbnb's customers would be more accurate. Consequently, we chose to look at the changes in number of reviews for each city throughout the pandemic in relation to the governments' policy changes.

New York City

  1. March 2020 - Pause on Program begins requiring all non-essential workers to stay home
  2. June 2020 - NYC begins reopening (phase 1 and phase 2)
  3. November 2020 - new restrictions are reinstated (curfews, indoor dining restrictions, limited gatherings in private homes)

Paris

  1. March 2020 - strict lockdown imposed (no non-essential movement allowed and need travel pass to move)
  2. June 2020 - easing of restrictions; travel to other EU countries with open borders is allowed
  3. November 2020 - 2nd strict lockdown with additional restrictions in Paris

Sydney

  1. March - Borders close to non-residents and residents are required to quarantine
  2. May- Easing of lockdowns accross Australia
  3. December - Northern part of Sydney declared a hotspot and restrictions increase

Conclusion

Governments in all 3 cities implemented policies of various degrees in March, following the spread of Covid-19 outside of Asia. Paris, which had the strictest lockdown policy of the three cities, restricted the movement of individuals coming into the city and also within the city. Consequently, the graphs show that Paris saw the greatest drop of guests at Airbnbs. On the other hand, New York City, which did not close its borders and did not legally restrict travel in the city, did not see as large of a percent decrease. Around June for Paris and New York City and May for Sydney, Covid-19 number seemed to be improving and local governments opted to ease quarantine and social distancing policies. All three cities saw an increase in guests but no where near pre-Covid levels. Torwards the end of the the end of 2020, Covid-19 cases where on a rise again accross the globe. In response to to fears of a 2nd wave, the local governments of the 3 cities re-implemented restrictions. Interestingly, the usage of Airbnb and travel only decreased in Sydney and Paris: New York City was seemingly unaffected.This could be due to a variety of reasons, such as a differece in attitudes torward Covid-19 or conflicting information from the state and federal government.

Note: While we believe that observing the reviews left per month reflects the number of guests using Airbnb, it is important to note that review response can be delayed.